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Abstract 


Goal  management  is  the  process  of  recognizing  or  inferring  goals  of  individual  team 
members;  abandoning  goals  that  are  no  longer  relevant;  identifying  and  resolving 
conflicts  among  goals;  and  prioritizing  goals  consistently  for  optimal  team 
collaboration  and  effective  operations.  A  Markov  decision  process  (MDP)  approach  is 
employed  to  maximize  the  probability  of  achieving  the  primary  goals  (a  subset  of  all 
goals).  We  seek  to  address  the  computational  adequacy  of  an  MDP  as  a  planning 
model  by  introducing  novel  problem  domain-specific  heuristic  evaluation  functions 
(HEF)  to  aid  the  search  process.  We  employ  the  optimal  AO*  search  and  two 
suboptimal  greedy  search  algorithms  to  solve  the  MDP  problem.  A  comparison  of  these 
algorithms  to  the  dynamic  programming  algorithm  shows  that  computational 
complexity  can  be  reduced  substantially.  In  addition,  we  recognize  that  embedded  in 
the  MDP  solution,  there  are  a  number  of  different  action  sequences  by  which  a  team ’s 
goals  can  be  realized.  That  is,  in  achieving  the  aforementioned  optimality  criterion,  we 
identify  alternate  sequences  for  accomplishing  the  primary  goals. 

1.  Introduction 
1.1.  Motivation 

Changing  patterns  of  today’s  world  impose  the  need  for  current  and  future  military  forces  to  conduct  a 
broader  and  more  complex  spectrum  of  operations.  Wars  no  longer  take  place  between  nation-states 
on  traditional  battlefields,  but  have  been  replaced  by  emergent  and  asymmetric  threats  involving 
cultural  factions  and  trans-national  players.  Decisions  must  be  made  in  real  time  with  simultaneous 
tactical,  operational,  and  strategic  implications.  In  response  to  these  demanding  requirements,  military 
forces  need  to  employ  new  operational  concepts  and  command  approaches.  This  calls  for  much 
greater  emphasis  on  realistic  modeling  of  dynamic  military  organizations  that  enable  them  to  evaluate 
their  current  strategies,  their  strengths  and  weaknesses,  and  explore  various  strategic  options  based  on 
current  knowledge  and  forecasts. 

A  team  management  mechanism,  which  ensures  the  cooperation  of  individuals  in  their  pursuit  of 
desired  organizational  goals,  involves  both  managing  the  team’s  intentions  and  organizing  its  activities 
to  fulfill  them.  A  desired  system  state  describes  organization’s  internal  and  external  conditions.  It  is 
comprised  of  many  independent  or  loosely  dependent  dimensions  of  the  system  and  its  environment 
(i.e.,  set  of  desired  goals,  resources  available  to  pursue  the  goals,  time  available  for  mission  processing, 
etc.).  Deliberate  changes  in  states  are  brought  out  by  functions  or  (a  set  of)  actions ,  which  are 
assigned  to  individuals  within  a  team.  That  is,  a  function  can  imply  a  specific  intent  to  change  the  state 
of  the  environment  or  an  activity  carried  out  by  an  individual(s)  to  perform  this  change.  Goal 
management  is  the  process  of  prioritizing  goals  consistently  for  optimized  team  collaboration.  The 
goal  management  problem  is  formulated  as  a  Markov  decision  problem,  in  which  the  objective  is  to 
determine  an  optimal  closed-loop  policy  that  maximizes  the  probability  of  reaching  the  final  desired 
system  state  under  resource  and  time  constraints. 

In  order  to  elucidate  the  points  made  throughout  the  paper,  let  us  consider  the  following  simplified 
scenario.  Assume  that  there  are  two  opposing  coalitions:  the  RED  and  the  BLUE  factions.  Political 
factors  (introduced  by  broad  coalitions  of  all  parties  involved),  the  geographic  spread  of  BLUE 
military  industrial  complex  (the  location  of  maintenance  facilities,  army  depots  and  logistics  pipelines), 
mixed  ethnicities,  cultures,  and  religious  backgrounds  among  the  parties  involved,  significantly  affect 
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the  course  of  operations.  This  situation  produces  complications  with  a  direct  bearing  on  the  course  of 
military  activities.  With  enormous  humanitarian,  political,  and  economic  stakes,  the  BLUE  forces 
resolve  to  bring  peace  to  all  factions  involved. 

BLUE’s  political  and  military  experts  around  the  globe  agree  that  there  are  three  major  avenues  to 
resolving  the  conflict.  The  first  is  a  pure  political  solution,  wherein  all  parties  involved  would  resolve 
their  differences  through  negotiations  and  constructive  talks.  In  this  case,  the  involvement  of  military 
forces  would  be  restricted  to  minimizing  hostilities  in  the  region  of  conflict.  The  second  solution 
proposes  an  all  out  war  to  remove  RED’s  forces  from  the  region.  To  ensure  a  successful  outcome,  this 
approach  will  require  BLUE  to  have  accurate  assessments  on  the  strengths  and  weaknesses  of  both 
sides.  In  order  to  do  this,  reconnaissance  missions  need  to  be  conducted  to  accurately  estimate  RED’s 
forces.  Based  on  the  gathered  information,  strategies  for  an  all-out  war  are  drawn.  The  BLUE’s 
experts  also  suggest  combining  the  former  strategies  to  produce  a  comprehensive  solution  involving 
both  political  and  military  approaches.  Suppose  the  BLUE’s  military  strategists  are  weighing  these 
three  approaches  and  are  analyzing  the  possible  outcomes.  The  tactical  roadmap  of  options  and 
outcomes  can  be  described  by  an  AND/OR  graph  as  in  Figure  1.  The  task  is  to  identify  a  strategy, 
which  maximizes  the  possibility  of  reaching  a  peace  accord,  given  specified  resource  and  time 
constraints. 


Figure  1.  The  AND/OR  Goal  Graph  of  BLUE  Coalition 
1.2.  Representation  of  the  Problem 

The  Markov  decision  process-based  strategy  roadmap  connecting  the  initial  system  state  to  the  final 
state,  via  a  set  of  intermediate  states,  is  represented  by  an  acyclic  graph.  The  nodes  of  the  graph 
denote  system  states.  The  arcs  denote  transition  probabilities  among  system  states,  which  depend  on 
functions  executed  at  the  state  and  on  the  amount  of  resource  and  time  available  at  this  state.  Each 
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function  requires  a  certain  amount  of  resource  to  complete  .  Transition  probabilities  to  move  from  one 
state  to  another,  given  the  control  function  applied  at  a  state,  are  given.  The  objective  is  to  find  a 
sequence  of  control  functions  (actions)  that  maximize  the  probability  of  reaching  the  final  goal  state 
under  resource  and  time  constraints. 

More  formally,  the  approach  can  be  formulated  as  follows.  Let  S  =[X,R,D]  denote  a  state  space  that 
includes  the  resource  state  space  R  and  the  duration  dimension  D  (such  that  D  =  [0,d\,d  e  I 
(integer),  and  R  =[0,r],re  I ).  The  notation  X  =  {1,2, A  ,/?}  4  denotes  the  state  space  whose  elements 
represent  different  ‘ states  of  the  system  goals’.  Formally,  the  system  state  is  characterized  by  the 
three-tuple  s  =  (i,y),  where  i  denotes  the  state  of  goals,  y  =  (r,d )  with  r  denoting  the  available 
resources  and  d  denoting  the  available  time. 

Let  F(i,y)  =  {fm(i,y),m  =  l,...,M(i,y)}  be  a  set  of  control  functions  that  can  be  applied  when  the 
system  state  is  s  =  (i,y).  The  notation  d(  fw(i,  y))  >  0  defines  the  duration  of  applying  a  control 
function  fm(i,  y)  in  state  i,  while  r(fm(i,y))>  0  denotes  resource  requirements  for  control  function 
fm(i,y).  A  function  fm(i,y)  transitions  the  goal  state  ie  X j  to  the  goal  state  je  X i  in  c/(/(J  time 
units,  while  utilizing  r(fm)  resources. 


Let  the  strategy  roadmap  denote  a  directed  graph  £2(5, V)  consisting  of  primary  nodes  S  =  {y} 
representing  system  states  and  edges  V  =  { v } .  The  edges  denote  transition  probabilities  (i.e., 
probabilities  of  reaching  the  intended  system  states  when  certain  control  functions  are  carried  out).  Let 
Pij  ( fm  (b  y)) =  P(xk+ 1  =  j  I  xk  ~  F  >’,  /„,  O’,  y))  denote  the  transition  probability  that  the  goal  state  at  stage 

k  + 1  is  j  given  that  the  goal  state  at  stage  k  is  i  and  that  a  control  function  fm  ( i ,  y)  is  applied. 
When  function  fm(i,y )  is  executed  by  the  system,  this  execution  takes  d(fm(i,y ))  units  of  time  and 
expends  r(fm(i,y ))  resource  units.  Therefore,  when  function  fm(sk )  is  applied  at  state  yk  =  ( i,y  ), 


3  For  elucidation  purposes,  we  consider  only  one  resource  type.  Extension  to  vector  of  resources  is  straightforward. 

4  The  notation  Xk  =  i  means  that  the  system  goal  state  is  i  at  stage  k  . 
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the  system  state  probabilistically  changes  to  sk+1=(j,y  ),  where 

y k+\  =K  -r(fm(i’yk))’dk  ~d(fm(i,yk))]  ■  The  probabilistic  notions  are  introduced  to  accommodate 

various  unforeseen  events,  such  as  execution  failures  and  random  state  shifts  during  the  execution  of 
control  functions.  Based  on  the  problem  constraints,  we  generate  a  layered  acyclic  graph  of  system 
states  with  N  layers.  The  objective  is  to  maximize  the  probability  of  reaching  the  final  goal  states 
P(xN  =1,1  i e  XN),  subject  to  yQ  =(r,d),  yk  >0, \/k  =  l,...,N . 
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Table  II.  Markov  State  Representation  of  AND/OR  Goal  Graph 


Figure  2.  The  Transitions  from  xk  =1  to  Successors  xk+l  =  j  Given  fm(\,  y) 
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Let  us  examine  the  scenario  presented  in  Figure  1.  In  Table  II,  we  can  see  the  transformation  of 
AND/OR  graph  nodes  in  Figure  1  into  the  X  states  of  a  strategy  roadmap.  The  nodes  no  longer 
denote  the  goal  states  exclusively;  they  become  representations  of  the  system  states.  For  example, 
xk  =  6  represents  the  system  state  in  which  goals  G,  and  G3  have  been  achieved  ( G0  represents  the 

known  initial  system  state).  Furthermore,  if  the  system  is  in  state  xk  =  6 ,  it  can  transition  to 
xk+1  =  6,8,13 ,  or  14  by  applying  f2,  f5 ,  /7 ,  or  /8 .  That  is,  the  strategy  roadmap  representation  naturally 
captures  the  options  and  restrictions  of  the  system  at  each  state.  If  the  system  is  in  xk  =  6  ,  the  function 
/9  may  not  be  executed,  for  this  requires  G2  to  have  been  achieved.  Moreover,  being  in  xk  =  6 
would  make  it  impossible  for  the  system  (given  its  allowable  options)  to  reach  xk+l  =15  .  In  addition, 
being  in  state  xk  =  6  also  makes  state  5  infeasible.  That  is,  after  reaching  G, ,  it  is  not  viable  for  the 

system  to  abandon  it.  Although  this  assumption  is  somewhat  restrictive,  it  is  a  good  assumption  in 
many  situations.  For  example,  it  is  a  good  assumption  if  the  goal  is  to  destroy  a  target;  if  the  goal  is 
achieved,  whatever  the  system  does  will  not  undo  it. 

In  general,  execution  failures  and  random  state  shifts  during  the  execution  of  control  functions  may 
actually  transform  the  system  to  states  other  than  the  intended  one.  The  MDP  based  strategy  roadmap 
representation  easily  accommodates  this  case.  For  example,  assume  that  the  system  is  at  [6,  r,d]. 
When  function  /8  is  applied  (that  is  assuming  that  r-r(/8)>0  and  d  -d(fs)  >  0 ),  due  to  various 
internal  and  external  events,  the  system  may  transition  to  [6,  r  -  r(/8 ),  d  -  d (/g )] , 
[8, r  —  r(/g ), d  - d (/8 )] ,  [\3,r- r{fs),d-d(fs)],  or  [l4 ,r -  r{f8),d  - d{f8)\  (See  Table  IV  in  the 
Appendix).  Furthermore,  in  order  to  reduce  the  number  of  system  states,  we  can  combine  all  the 
absorbing  states  into  one.  That  is,  noting  that  xk  =11  =  12  =  13  =  14  =  15  =  16  all  represent  the  terminal 

states,  we  can  simply  label  them  as  xk  =  1 1 . 

1.3.  Contributions  and  Earlier  Work 

The  goal  management  problem  belongs  to  the  class  of  planning  problems  under  uncertainty  (which  is 
typically  termed  decision-theoretic  planning  (DTP)).  We  adopt  a  Markov  decision  process  (MDP) 
framework  as  an  underlying  model  for  the  problem. 

Formulation  of  planning  problems,  and  in  particular  probabilistic  planning  problems,  as  graph  search 
problems  (specifically  as  MDP)  has  attained  a  surge  of  interest  in  recent  years,  [11],  [12].  See  also  [5], 
[6],  [13],  and  [15],.  An  MDP  is  a  suitable  representation  of  goal  management  problems,  since  the 
model  takes  into  consideration  problem  uncertainties,  which  include  uncertain  effects  of  actions, 
incomplete  information  about  the  environment,  and  uncertainties  in  the  goal  states.  Moreover,  it 
models  our  problem  accurately,  since  the  objective  of  an  MDP  is  to  devise  courses  of  actions  (plans  or 
policies)  with  a  high  probability  of  success,  in  contrast  to  an  assured  attainment  of  intended  goals  as  in 
a  traditional  deterministic  planning  problem. 

Although  the  representation  of  a  DTP  as  an  MDP  is  not  new,  the  explicit  connection  between  the 
traditional  planning  routines  (in  particular  AND/OR  graph  representations,  which  exclude 
uncertainties)  and  the  MDP  is  novel.  Adopting  the  MDP  framework  as  a  model  for  formulating  and 
solving  goal  management  problems  has  its  advantages  and  shortcomings.  Goal  management  problems 
typically  exhibit  considerable  structure  in  value  functions  describing  the  performance  criteria,  in 
functions  depicting  state  transitions,  and  in  relationships  among  features  used  to  describe  states  and 
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actions.  These  permit  the  use  of  special  purpose  methods  that  recognize  and  exploit  that  structure; 
thereby  allowing  it  to  be  solved  with  less  computational  effort  than  other  methods.  Specifically,  the 
MDP  representation  takes  into  account  the  problem  domain  specifics  and  uses  them  to  its  advantage, 
viz.,  optimizing  the  cardinality  of  the  state  space.  In  the  given  scenario,  the  cardinality  of  the  system 
states  I  S'!  =  11  instead  of  2 5  (the  example  encompasses  5  goals);  the  cardinalities  of  the  control 
function  sets  I  F  I  are  between  1  to  7  ,  instead  of  24  (the  example  has  4  possible  actions).  See  Table  I 
and  Table  II. 

The  general  impediment  [6]  to  the  more  widespread  acceptance  of  MDPs  as  a  general  model  of 
planning  is  the  computational  adequacy  of  MDPs  as  a  planning  model:  can  the  techniques  scale  to 
solve  planning  problems  of  reasonable  size?  One  difficulty  with  the  solution  techniques  for  MDPs  is 
the  tendency  to  rely  on  explicit,  state -based  problem  formulations.  This  can  be  problematic,  since  state 
space  grows  exponentially  with  the  number  of  problem  features  (e.g.,  multiple  types  of  resources,  large 
number  of  control  actions,  etc.).  This  paper  seeks  to  alleviate  the  computational  burden  in  the  MDP- 
based  approach  by  introducing  problem  domain-based  heuristics  into  the  search  algorithms. 

1.4.  Organization  of  the  Paper 

We  formulate  the  goal  management  problem  as  a  finite-state  Markov  decision  problem.  This  approach 
opens  doors  to  utilizing  a  broad  field  of  mature  graph  search  techniques.  In  particular,  we  explore 
various  optimal  and  sub-optimal  heuristic  search  approaches  and  contrast  the  solutions  (for  small 
problems)  to  those  of  dynamic  programming  (DP)  recursions. 

The  paper  is  organized  as  follows.  Section  2  explores  various  graph  search  techniques,  which  include 
the  classical  DP  techniques,  AO  heuristic  search  algorithm,  and  greedy  heuristics.  Novel  problem 
domain-based  heuristic  evaluation  functions  (HEFs)  are  introduced  and  evidence  of  their  admissibility 
is  presented.  Furthermore,  greedy  heuristic  search  techniques  are  also  considered  in  this  paper.  In 
section  3,  we  compare  various  algorithms  on  an  illustrative  example.  Finally,  we  conclude  with  a 
summary  of  findings  and  future  work. 

2.  Solution  Approaches 

The  paper  seeks  to  alleviate  the  computational  burden  inherited  by  MDP-based  approaches  through  the 
use  of  problem  domain-based  heuristics.  In  particular,  the  paper  explores  the  optimal  AO  algorithm, 
and  several  suboptimal  greedy  heuristics.  The  (computational)  costs  and  performances  of  the 
aforementioned  approaches  are  then  contrasted  with  the  standard  DP  recursions. 

2.1.  Dynamic  Programming  (DP)  Recursion 

The  DP  is  a  common  technique  used  in  situations  when  decisions  can  be  made  in  stages  and  when, 
despite  the  unpredictable  nature  of  the  decision  outcomes,  the  desired  outcome  can  be  quantified  in 
advance.  The  idea  is  to  quantify  the  desired  outcome  in  a  mathematical  expression  (typically  referred 
to  as  the  objective  function)  and  the  goal  is  to  maximize  (minimize,  depending  on  the  nature  of  the 
problem)  it.  An  important  aspect  of  the  problem  is  that  a  decision,  made  at  each  stage,  is  made  based 
on  the  present  value  function  and  expected  future  value  [1]  and  [2].  We  are  interested  in  knowing 
which  set  of  control  functions  {f„(i,  y)}  to  apply  at  each  stage  such  that  it  results  in  maximal 
probability  of  reaching  the  final  goal. 
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We  represent  the  next  goal  state  xk+1  =  j  as  a  function  of  the  current  goal  state  xk  =  i  and  the  control 
function  being  applied  fm  (i,  y )  : 

j  =  g(hfm(i,y))  (1) 

Furthermore,  we  consider  control  laws  that  consist  of  a  sequence  of  admissible  functions 
7t  =  {/(0),/(1),...,/(iV”1)}5>  defined  by  f{k]  =  {(^(s^.),#:  =  0,K  ,N  - 1} .  The  notation  |lt(*)  signifies 
some  function  of  the  argument.  Given  an  initial  state  s0=(x0,y  )  and  an  admissible  policy 
7t  =  {/(0),/(1),...,/(JV_1)} ,  the  objective  function  is  given  by 

Jn(x0,y0)  =  P(xN=l,lexN)  (2) 

where  XN  denotes  the  set  of  absorbing  states  (intended  goal  states).  Recall  that  when  the  absorbing 
states  are  lumped  into  one  state,  we  set  XN=n  so  that  I  X  v  1=1 .  Thus,  an  optimal  policy  7t*is  one, 
which  maximizes  the  following  objective  function: 

Jn’(x0,yo)  =  maxJK(x0,yo )  (3) 

The  DP  technique  is  based  on  the  principle  of  optimality.  See  [1]  and  [2],  In  words,  it  states  that  if 
there  exists  an  optimal  policy  n0*  ={/(0>\/<n*,...,/(A'“1)*},  then  the  truncated  policy 
n*  ={f(k)*,f(k+1]*,...,  f{N-lY]  is  also  optimal.  The  DP  decomposes  the  problem  into  a  sequence  of 
optimizations  carried  over  the  set  of  control  functions.  The  DP  algorithm  starts  with  JN(xN,yN )  and 
proceeds  backward  in  stages  from  stage  N- 1  to  0.  We  let  Jk(i,r,d )  be  the  cost-to-go  (value-to-go 
in  our  case)  function  at  stage  k  .  From  the  principle  of  optimality,  the  dynamic  programming  (DP) 
algorithm  can  be  written  as 

J,  ( i,r,d)=  max 

feF(i,y) 

r(f)<r,d(f)<d 

with  a  terminal  condition 

J*N(n,r,d )  =  1,  J*N(i,r,d )  =  0,  Vi  =£  n,r  >  0,d  >  0  (  5  ) 

However,  the  DP  algorithm  has  computational  requirements  of  0(( I  X  I  M)Q) ,  where  M  =  max{l  F  l} 
denotes  the  largest  cardinality  of  the  control  function  sets,  I  X  I  represents  the  cardinality  of  the  goal 
states,  and  Q  =  min{rjmaxr(f),djmaxd(f)}.  This  can  be  quite  substantial  for  large 

M  ,1  X  I,  and  r  (or  d) . 


,  Pn  Jk+ 1  U> r  ~  r(f),  d-d  (/)) 


,k  =  N-  1,A  ,0 


(4) 


5  Note  that  superscripted  f[k)  —  G)  :  /  (^)g  F(s),S£  Sk }  denotes  a  set  of  control  functions  applied  at  the  k  -th 
stage,  Sk  signifies  the  set  of  system  states  at  stage  k  ;  whereas  subscripted  f  signifies  the  in  -th  control  function. 


2.2.  Heuristic  Approaches 

Another  approach  for  planning  under  uncertainty  is  based  on  state-based  graph  search  [6].  These 
techniques  employ  estimates  of  value-to-go  functions  in  (4),  called  heuristic  evaluation  functions 
(HEFs),  to  overcome  the  computational  explosion  of  the  DP  recursion.  In  the  following,  we  consider 
two  such  HEFs,  as  well  as  greedy  heuristics  that  employ  local  step-by-step  optimization. 

2.2.1.  Heuristic  Evaluation  Functions  (HEFs) 

We  formulate  the  problem  of  maximizing  the  conditional  probability  of  reaching  a  final  goal  state 
under  resource  and  time  constraints  as  an  informed  best-first  search  on  an  MDP-based  graph,  wherein 
the  approximate  value-to-go  (HEF)  is  derived  from  the  current  conditional  probability  values  of  a 
partially  developed  tree  and  the  expected  depth  of  the  tree  (the  residual  layers,  based  on  the  remaining 
resources  and  time).  The  HEF  allows  for  the  use  of  a  top-down  search  algorithm,  such  as  AO  [16].  In 
the  following,  we  develop  two  HEFs. 

22.1.1.  HEF  1:  \  (s) 

The  first  HEF  follows  directly  from  the  DP  recursion  (4),  and  the  fact  that  the  optimal  value-to-go  is 
bounded  by  1 : 


Jk(i,r,d)<  max  Ypij(f),k  =  0,A,N-l  (6) 

feF(i,y)  “  J 

r(f)<r,d(f)<d  1 

The  upper  bound  depends  on  the  value  of  the  transition  probability 

Py  (/,„  O',  y))  =  P(xk+ 1  =  j  I  xk  =  i,  y,  fm  0,  y)) .  This  bound  determines  our  first  HEF: 

hl{s)=  max  £p#(/),V*t  =/,Vfc  =  0,...,JV-1  (7) 

feF(i)  “ 
r(f)<r,d(f)<d  J 


2.2.12.  HEF  2:  h2(s) 


In  addition  to  ptJ ,  J k  ( i,r,d )  depends  on  the  number  of  the  remaining  layers,  Nmm  at  stage  k  (before 

the  system  exhausts  either  r  or  d  ).  Recall  that  the  problem  constraints  restrict  the  number  of  stages, 
N  ,  in  the  MDP  graph.  The  minimum  number  of  remaining  stages  is  determined  as  follow: 


f 

(  r  ^ 

( d  Y\ 

min 

max  ceil 

'  min 

,  max  ceil 

min 

feF(i) 

V 

(r(f)j 

fcF(i) 

d(f) 

V  yj  '  J J 

(8) 


The  minimum  resource  requirement  rmjn  is  the  smallest  amount  of  resources  the  system  needs  to 
expend  to  reach  the  desired  state  from  the  current  state;  whereas  dmm  is  the  shortest  time  required  for 
the  system  to  reach  the  desired  state  from  the  current  state.  Note  that  the  operator  ceil(-)  rounds  the 
number  up  to  the  next  integer  value.  Using  (8),  we  define  the  second  HEF: 
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(9) 


h2(s)  =  (  max  £pi;(/))'V' 

feF(i)  *“ 
r(f)<r,d(f)<d  J 


2.2.2.  Admissibility  of  the  HEFs 

In  this  subsection,  we  seek  to  prove  that  the  selected  HEFs  have  the  property  of  admissibility  [16], 
[17].  When  the  HEFs  are  admissible,  the  objective  function  approaches  the  optimal  objective  function 
vcdue  from  above  (in  the  case  of  maximization) . 

Definition  1:  Let  Q(S )  be  an  MDP  based  strategy  graph.  An  HEF  h(s)  defined  on  El  (A )  is 
admissible  if  for  each  node  ye  5  in  £l(S) ,  h\s)  <  h(s) ,  the  optimal  value-to-go.  Moreover,  this  h(s) 
is  always  finite  with  an  upper  bound  of  1. 

The  admissibility  of  the  first  HEF  follows  directly  from  (4). 


h*(s)  =  Jk*(i,r,d)  <  max  V  pJf)  =  hfs),k  =  0,A  ,N- 1 

feF(i.y)  .  J 

r(f)<r,d(f)<d  1 


(10) 


In  the  same  vein,  the  analysis  for  the  admissibility  of  second  HEF  follows  from: 


K\s)  =  J*k(i,r,d) 


<  (  max  V  P(xk+l  I  xk  Jk  ))N,m"  =  h2 (s),  \/xk  =  i,  \/k  =  0,...,  N  - 1 

fteF(sk ) 
r(f)<r.dlf)<d  -I 


(11) 


Unfortunately,  the  second  HEF  does  not  always  result  in  an  admissible  heuristic.  When 
3xk+1*j:  V  max  P(xk+1 1 xk ,  fk )  »  max  YP(j\xk,fk)  and  Nmin~N-k,  the  second  HEF  is 

inadmissible.  Additionally,  when  the  cardinality  of  I  A  I  is  large,  which  means  that  the  value  of 
max  Pjj(f  )  is  small  (recall  that  |  pff )  =  1 ),  the  second  HEF  is  also  potentially  inadmissible. 

feF(i)  J 

Otherwise,  the  second  HEF  is  in  general  admissible. 


2.2.3.  AO  Algorithm 


AO  is  a  best-first  search  algorithm  [16],  [17],  which  expands  only  nodes  with  the  most  promising 
chance  of  reaching  the  goal  nodes  on  the  basis  of  the  HEF.  The  algorithm  utilizes  three  steps 
repeatedly.  First,  a  top-down  graph  traversing  operation  follows  the  best  current  path  and  accumulates 
the  set  of  nodes  that  are  on  the  path  and  not  yet  expanded.  Second,  the  procedure  selects  an 
unexpanded  node  and  expands  it.  The  HEF  of  all  the  successors  of  the  expanded  node  are  computed 
and  adds  these  nodes  to  the  graph.  Third,  a  bottom-up  value  revising  operation  changes  the  HEF  of  the 
expanded  node  and  propagates  this  change  back  to  the  initial  node  according  to  the  DP  recursion  in  (4). 
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Following  [16],  there  are  two  important  attributes  of  this  heuristic  search  strategy  applicable  here. 
First,  if  the  HEF  is  admissible,  AO  guarantees  an  optimal  solution.  Second,  the  search  efficiency 
depends  critically  on  the  degree  to  which  the  HEF  approximates  the  optimal  value-to-go. 

In  this  paper,  we  use  the  following  conventions.  We  label  nodes  that  are  expanded  as  CLOSED. 
Nodes  that  are  generated,  but  not  yet  expanded,  are  termed  OPEN .  These  two  sets  are  maintained 
throughout  the  search  process. 

Algorithm  AO*: 

Step  I:  Initially  let  the  search  graph  Q(S)  consist  of  the  start  node  (that  is,  the  initial  goal,  the  initial 
available  resources,  and  allowable  duration)  s0  =  (x0,  y  ) ,  yQ  =  (r,d) .  Set  G(s0 )  =  h(s0).  If  s0 
is  a  terminal  node,  then  label  s0  CLOSED  and  exit  with  the  solution;  otherwise,  label  it 
OPEN . 

Step  2:  Repeat  the  following  steps  until  s0  is  labeled  closed.  Then  exit  with  J  =  G(s0)  as  the  expected 
value  and  the  marked  solution  tree  as  the  function  strategy. 

Step2.1:  Compute  a  partial  solution  graph  Q.  in  Q(S)  by  tracing  down  the  marked  arcs  in  Q(S) 
from  the  root  node  y0 .  Select  for  expansion  a  node  st  of  Q.  that  has  the  smallest  h(st ) 
(initially  st  =  y0 ). 

Step2. 2:  Generate  in  Q(S)  all  successors  of  si=[i,r,d\  spanned  by  the  allowable 
functions  /  e  F(s(.)  :  Sj  =  [j,r-r(f),  d -d(f)\.  Label  all  Sj  as  OPEN.  For  each 
immediate  successor  of  st  not  already  present  in  Q.(S)  ,  set  G(s  / )  =  h(s/ ) .  If  any  s-  is 
a  terminal  leaf  node,  label  s_]  CLOSED. 

Step2.3:  Create  a  temporary  set  Z  of  nodes  consisting  only  of  nodes  st . 

Step2.4:  Repeat  the  following  steps  until  Z  is  empty 

Step  2.4.1:  Remove  from  Z  a  node  s  .  such  that  no  successor  of  s_]  in  Q.  occurs  in  Z  . 
Step  2.4.2:  Revise  the  value  HEF  of  s.  as  follows:  e-  max  { V  Pj.Gls-)} .  Let  k  be 

fm<  l'(xi)  I  IJ  J 

the  index  of  the  function  for  which  maximum  occurs.  Resolve  ties 
arbitrarily,  but  give  preference  to  CLOSED  nodes.  Mark  all  arcs  spanned 
by  fk  (both  in  Z  and  Ll(S) ).  Label  st  CLOSED  if  all  of  its  successors 
are  labeled  CLOSED. 

Step  2.4.3:  IF  G(st )  *  e,  THEN  Set  G(s,.  )  =  e. 

Step  2.4.4:  If  G(y,)  changes  its  value  in  step  2.4.3  or  if  s.  is  labeled  CLOSED ,  then 
add  to  Z  st  and  all  of  the  ancestors  of  st  along  the  marked  path.  Ignore 
ancestors  of  st  not  connected  to  st  by  marked  arcs. 


2.2.4.  Greedy  Heuristics 

The  term  greedy  heuristic  refers  to  the  notion  that  all  of  the  approximation  techniques  in  this  category 
employ  a  local,  step-by-step  optimization.  The  optimization  techniques  are  typically  in  the  form  of  a 
A' -step  look-ahead  procedure.  In  this  subsection,  we  consider  two  such  techniques,  with  k  =  1 . 


11 


2.2.4.I.  Greedy  Heuristic  1  (GH1) 


To  select  the  best  control  function  for  a  given  state,  the  system  needs  to  estimate  the  probability  of 
reaching  the  intended  goal  state  n  by  executing  each  available  function.  Let  X k+]  (/)  be  the  set  of  all 

direct  successors  of  the  current  goal  state  xk ,  transformed  by  control  function  /  e  F(xk ) ,  and  let 
\Xk+l(f)\  be  the  cardinality  of  the  direct  successor  space.  In  our  example,  if  the  current  state  is 
xk  =  1 ,  then  I  Xk+l (/,)  I  =  2 ,  I  Xk+l (f5)  1  =  4,  and  so  on.  The  first  greedy  heuristic  (GH1)  works  on  the 
premise  that  a  function  with  a  larger  I  Xk+l(f)  I  has  a  better  chance  of  bringing  the  system  to  xN  =  n 
faster.  If  U (/)  is  the  heuristic  value  for  executing  control  function  /  e  F(xk) ,  then  the  next  control 
function  fk  is  selected  to  maximize  U(f): 


fk  =  arg  max  (t/(/)) 

feF(xk ) 


I'lfl 


|  £lXw(/:Zw>0)l 

feF(xk) 

\xkM--lk+^0)\  else 


(12) 


2.2.4.2.  Greedy  Heuristic  2  (GH2) 


In  general,  greedy  search  GH1  tends  to  seize  immediate  reward  at  the  expense  of  long-term  gain.  To 
alleviate  this  propensity,  we  consider  a  heuristic  that  estimates  the  value  of  each  control  function, 
accounting  for  future  values.  In  this  vein,  at  each  state,  the  next  control  function  fk  is  chosen  to 
maximize 


fk  =arg 


max 

feF{i) 

r(f)<r,d(f)<d 


=  j  I  xk  =i,f) 


(13) 


The  latter  heuristic  is  the  heuristic  evaluation  function  discussed  earlier.  GH2  has  computational 
complexity  of  0(M  I  X  I  Q) ,  where  M  ,  I  X  I ,  and  Q  are  as  previously  defined.  The  computational 

reduction  compared  to  that  of  DP,  0{{\  X  I  M)Q) ,  is  substantial  and  is  due  to  the  fact  that  the  greedy 
heuristic  unequivocally  selects  a  function  to  execute  at  each  stage  (ties  are  resolved  arbitrarily). 

3.  Algorithm  Evaluation 

3.1.  Construction  of  Transition  Probabilities 

Recall  that  ptj (fm ( i,  y))  =  P(xk+]  =  j  I  xk  =i,y,fn(i,y)) denotes  the  transition  probability  that  the  next 
goal  state  is  j  given  that  the  current  goal  state  is  i  and  that  a  control  function  fm  (/',  y)  is  applied. 
Therefore,  the  transition  probability  depends  on  the  current  goal  state,  the  successor  goal  state,  and  the 
function  applied  (which  also  depends  on  the  current  state).  Generally,  pk  (fm  (i,  >’))  values  can  be 

inferred  from  historical  data.  For  illustrative  purposes,  however,  we  generate  the  transition 
probabilities  as  follows. 
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The  prior  probability  P(xk+1  =  j  I  xk  =  i)  is  viewed  as  a  random  variable,  which  is  governed  by  various 
unforeseen  events  in  the  system  such  that  P(xk+l  =  j\xk=i)  =  l.  Furthermore,  we  can  reasonably 

assume  that,  in  the  absence  of  any  of  unforeseen  events,  the  amount  of  resources  and  time  committed 
to  transform  the  system  state  to  the  next  state,  as  well  as  the  distance  between  the  two  states  determine 
the  probability  of  transition  to  the  intended  state,  g j(r(fm),d(fm),\\  j-i  IF1)  .  Based  on  Bayes’  rule 
and  the  total  probability  theorem  [14],  we  assume  the  following  relation  for  transition  probabilities: 


Pij(j\Uy,fm(Uy))  = 


gj 0(/m ), d (fm ),\\  j-i  H'^PU  1 0 


(14) 


where  •  denotes  a  valid  norm  of  the  arguments.  The  transition  probabilities  for  the  given  example  are 
generated  using  (14)  and  are  listed  in  Table  III  of  Appendix. 

3.2.  Computational  Experiments 

Consider  the  scenario  introduced  in  section  1,  where  the  desired  goal  state  xk  =11  is  to  be  achieved  on 
or  before  a  desired  deadline  of  d  =  6  time  units  using  at  most  r  =  5  units  of  resource.  It  is  assumed 
that  the  initial  goal  state  xk=l,  so  that  y0  =  (1,5,6) ,  as  shown  in  Figure  3.  The  available  control 

functions  fm(i,  y  )  and  their  reachable  successor  states  for  each  goal  state  are  as  listed  in  Table  IV  of 
Appendix. 

The  DP-based  optimal  decision  tree,  highlighted  in  Figure  3,  can  be  interpreted  as  follows.  In  the 
initial  state  (1,5,6),  the  possible  actions  are  {/ ,  f2 ,  /3 ,  /5 ,  /6 ,  /7 }  •  At  this  stage,  the  best  control 

function  is  either  /j  or  f2.  If  the  system  chooses  / ,  and  if  the  next  state  (which  could  be  either  (1,2,2) 
or  (2,2,2))  is  (1,2,2),  then  the  organization  should  stop,  because  any  option  it  chooses  (f2  or  /3)  will 
not  lead  to  the  desired  state.  However,  if  the  next  state  is  (2,2,2),  the  best  control  action  is  f2  .  On  the 
other  hand,  if  the  system  selects  f2  at  the  initial  state,  and  if  the  next  state  is  (3,4,5),  the  best  control 
action  is  /5 ,  and  so  on. 

Next,  the  AO  algorithm  with  HEF  /?,  (s)  was  applied  for  this  example.  The  resulting  decision  tree  is 
shown  in  Figure  4.  Note  that  at  k  =  0 ,  there  are  six  feasible  actions  { f ,  f2 ,  j\ ,  f\ ,  /6 ,  /7}.  Among 

these,  f2  has  the  largest  value  of  highest  revised  expected  value  ^  /r  /7(y;)  at  the  initial  node.  In 

j 

particular,  ^  p^Ms  k)  for  f2  is  0.243(0.327)  +  0. 141(0.670)  =  0.174 .  Consequently,  at  this  stage,  the 

j 

algorithm  chooses  f2  as  the  control  function.  The  partial  tree  through  function  f2  is  then  expanded. 
The  search  proceeds  further  as  follows.  First,  partial  tree  through  function/,  is  traced  to  its  terminal 
nodes.  The  goal  state  (1,4,5)  has  the  smallest  HEF,  and  therefore  it  is  expanded  first.  It  is  determined 
that  at  this  point  f2  yields  the  highest  revised  expected  value  at  (1,5,6).  At  the  other  possible  goal 
state  (3,4,5),  function  f5  is  deemed  as  the  best  choice.  At  each  of  the  new  nodes,  the  previous  step  is 
repeated.  The  final  revised  value  is  0.022,  the  optimal  value  as  previously  obtained  via  DP  solution. 
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The  graphs  at  each  cycle  of  AO  are  shown  in  Figure  4,  and  the  optimal  decision  tree  is  as  revealed  at 
the  last  cycle.  In  this  example,  in  the  initial  state  (1,5,6),  the  best  control  function  is  f2 .  If  the  next 
state  is  (1,4,5),  the  best  control  action  is  f2 ,  and  so  on. 


Figure  3.  State  Transitions  with  Selected  Sets  of  Policies  via  DP- Algorithm 


Embedded  in  the  optimal  decision  tree  are  a  number  of  different  sequences  in  which  a  team’s  goals  can 
be  realized.  For  instance,  the  DP  identifies  all  four  of  the  best  option  sequences  (/2 ,  /5),  (f2 ,  f2 ,  f\ ), 

( /j ,  f2 ),  or  f2 ),  that  the  system  can  take  to  arrive  at  the  same  expected  success  probability.  The 

AO*  recognizes  the  first  two.  Moreover,  at  any  system  state,  the  organization  can  opt  to  choose  a 
control  function  other  than  the  best,  and  be  able  to  predict  the  consequences  of  the  selected  strategy.  If 
necessary,  all  of  the  algorithms  can  easily  carry  the  second  (or  even  third,  fourth,  etc.)  at  each  state  in 
all  stages  with  little  additional  computational  cost  and  at  a  slightly  increased  storage  cost. 
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Figure  4.  State  Transitions  with  Selected  Sets  of  Policies  via  AO  Search  Procedure  with  //,  (s) 


3.3.  Advantages,  Shortcomings,  and  the  Alternatives  for  the  AO  Algorithm 

Even  for  the  small-size  problems,  one  can  appreciate  the  advantage  of  AO  over  DP  in  alleviating  the 
computational  explosion  by  avoiding  the  exploration  of  the  entire  set  of  solution  trees.  In  this 
example,  the  AO  search  generates  only  19  nodes,  as  opposed  to  the  DP  that  would  have  required  78 
nodes  (with  M  =  7,  I  A  1=11,  and  Q  =  1,  DP  requires  O (77)  nodes).  The  number  of  backtracks  for 
AO  with  HEF  1  is  0  for  this  example.  In  general,  AO  with  an  admissible  and  tight  (close  bound  of 
the  value-to-go)  HEF  significantly  reduces  the  computational  burden  of  the  DP,  while  still  providing 
the  optimal  solution  tree. 
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As  Q  increases  (that  is  as  ( r,d )  pairs  increase),  however,  the  computational  burden  of  backtracking  in 
AO  is  considerable.  This  is  one  of  the  inherent  shortcomings  of  the  approach. 


Figure  5.  Functional  Strategies  Obtained  via  GH1  and  GH2 

The  first  greedy  heuristic  (GH1)  tends  to  seize  the  immediate  rewards  at  the  expense  of  long-term  gain, 
and  subsequently  suffers  from  the  consequences.  See  Figure  5.  Fortunately,  this  is  not  always  the  case 
in  general.  As  will  be  shown  in  the  next  subsection,  this  approach  results  in  acceptable  control 
strategies.  The  second  greedy  heuristic  (GFI2)  alleviates  the  propensity  to  seize  the  immediate  rewards 
by  using  HEFs  to  account  for  future  values.  The  greedy  search  chooses  a  node  with  the  largest  HEF  at 
each  step.  It  differs  from  AO  ,  in  the  sense  that  it  performs  limited  search  with  no  backtracking.  In 
our  example,  GH2  results  in  the  same  solution  as  that  of  DP  and  AO  ;  see  Figure  5.  The  second 
greedy  heuristic  GH2  typically  results  in  near-optimal  control  strategies. 

3.4.  Optimal  Choices  of  Resources  and  Duration  Lengths 

Figure  6  illustrates  the  effects  of  r  and  d  on  the  probability  of  success  P{xN  =11).  The  shape  of  the 
surface  plot  indicates  that  in  order  to  achieve  higher  probability  of  success,  the  system  needs  to  commit 
higher  r  and  d  .  The  flat  lines  in  the  contour  plot  (below),  however,  indicate  that  for  each  value  of  r 
(or  d  ),  there  is  only  a  limited  set  of  options  for  the  system  in  term  of  the  other  variable.  This  result  can 
be  illustrated  further  via  the  following  figures. 
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Contour  Plot  for  Various  (r,d)  Pairs 


5  10  15  20  25  30  35 

r  (Amount  of  Resources) 


Figure  6.  Surface  and  Contour  Plots  of  P(xN  =  n)  for  Various  ( r,d )  Pairs  via  GH2 

From  plots  in  Figure  7  and  Figure  8,  one  can  readily  observe  the  following  trends.  The  success 
probability  of  reaching  the  desired  goal  states,  P(xN  =11),  increases  as  r  and  d  increase.  As  d 

steadily  increases  from  1  to  40  time  units,  the  increase  in  P(xN  =11)  is  moderated  by  the  availability 
of  resource  units,  r .  In  particular,  for  GH2,  if  the  system  has  r  =  25  resource  units,  the  system  need 
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not  commit  beyond  d  =  29  time  units  because  the  probability  of  success  saturates  at  0.72.  On  the 
other  hand,  if  the  system  has  r  =  50  resource  units,  the  system  can  achieve  even  higher  P{xN  =11)  of 
0.93,  by  increasing  d  to  about  40  time  units.  See  Figure  8. 


P (x. .  =  n)  As  A  Function  of  Duration  for  Various  Number  of  Resources 

'  N  ' 


Figure  7.  P(xN  =  n)  as  a  Function  of  Duration  Lengths  for  Various  Resource  Units  for  AO 

One  can  also  observe  the  degree  of  suboptimality  of  the  greedy  heuristic  strategies  compared  to  those 
of  the  AO  .  The  P(xN  =11)  of  the  AO  functional  strategies  reaches  0.46  when  d  =  15  time  units  for 
r  =  16  units,  whereas  the  GH2  never  attains  0.46  for  the  same  r  value.  It  saturates  at  0.45.  The  same 
situation  happens  for  GH1,  which  saturates  at  0.38.  For  r  =  25  units,  however,  the  performance  of 
AO*  and  GH2  algorithms  resemble  one  another  very  closely.  However,  GH1  lags  considerably.  It 
appears  that  the  degree  of  suboptimality  of  GH2  to  AO*  decreases  as  the  actual  value  of  P(xN  =11) 

increases.  This  may  be  attributed  to  the  HEF  being  able  to  closely  follow  the  actual  value-to-go.  For  all 
practical  purposes,  GH2  is  superior  to  GH1. 
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P(x  =  n)  As  A  Function  of  Duration  for  Various  Number  of  Resources 

'  M  ' 


Figure  8.  P(xN  =  n)  as  a  Function  of  d  for  r  for  GH1  and  GH2,  respectively 

Analogous  observations  can  be  made  on  the  impact  of  d  .  See  Figure  9.  As  the  number  of  committed 
resources  r  increases  steadily  from  1  to  50  units,  smaller  duration  length  d  saturates  P(xN  =11)  at  a 
lower  value.  For  example,  if  the  system  can  only  afford  d  =7  time  units,  the  system  need  not  expend 
beyond  r  =  10  units  of  resource.  The  small  value  of  d  limits  the  probability  of  success  at  about  0.17, 
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at  best.  One  can  also  observe  that  d  =  35  time  units  are  about  the  optimal  value  for  the  system,  with  a 
maximum  probability  value  of  0.93.  Even  if  the  system  has  more  time,  say  d  =  50  time  units,  the 
probability  is  about  the  same  as  that  for  d  =  35  time  units. 


P(xn  =  n)  As  A  Function  of  Number  of  Resources  for  Various  Duration  Length 


Figure  9.  P(xN  =  n)  as  a  Function  of  r  for  Various  Value  of  d  for  Greedy  Heuristic  2  (GH2) 

It  is  worth  noting  that  there  is  an  inherent  bound  on  the  probability  of  success  for  a  problem.  For 
example,  even  if  the  system  has  unlimited  r  and  d  (e.g.  beyond  (100,100)),  the  value  of  P(xN  =11) 

=  0.93.  That  is,  in  our  example,  the  problem  itself  has  an  inherent  failure  probability  of  around  0.07 
due  to  execution  failures  and  random  state  shifts  during  the  execution  of  control  functions. 

4.  Conclusions  and  Future  Work 

This  paper  adopted  a  Markov  decision  process  (MDP)  framework  as  an  underlying  model  for  the 
problem,  and  introduced  an  explicit  connection  between  the  traditional  planning  routines  (in  particular 
AND/OR  graph  representations,  which  exclude  uncertainties)  and  the  MDP-based  approach.  The 
objective  of  the  MDP  is  to  devise  courses  of  actions  (plans  or  policies)  with  a  high  probability  of 
success.  In  the  future,  we  will  augment  the  approach  to  include  forbidden  system  states  in  our  problem 
formulation.  That  is,  we  seek  to  find  control  strategies,  which  guarantee  a  high  probability  of  success 
in  reaching  the  desired  goal  states,  while  avoiding  the  forbidden  goal  states. 

The  paper  introduced  problem- specific  HEFs  into  the  search  algorithms  to  address  the  computational 
adequacy  of  MDPs  as  a  planning  model.  In  particular,  the  approach  exploited  the  goal  structure  to 
significantly  reduce  the  state  space.  The  integration  of  the  new  HEFs  and  heuristic  search  has  enabled 
us  to  find  verifiable  optimal  solutions  to  problems,  which  are  intractable  with  the  DP  approach.  It 
appears  that  the  greedy  heuristic,  GH2,  is  the  preferred  algorithm  for  large  size  problems. 
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It  is  evident  that  the  challenge  is  to  find  even  better  HEFs  for  the  problem.  In  [16],  it  is  suggested  that 
the  most  useful  HEFs  should  be  able  to  solve  at  least  a  special  case  of  the  problem  at  hand, 
computationally  efficient,  and  take  into  account  the  problem  domain  specifics.  For  example,  an 
admissible  HEF  may  be  derived  by  assuming  unlimited  r  and  cl .  In  addition,  the  AO*-based  greedy 
search  (GH2),  which  yields  promising  results,  can  be  further  improved  using  rollout  strategies  [3], 
[19]. 
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Appendix 


i 

ill 

Pn(f,J 

PnifJ 

PnUJ 

P  (4  (7m  ) 

PisifJ 

pMJ 

Pn  (fm  ) 

Pisifn.) 

Pi9(fJ 

Pi.MJ 

Pi.uifJ 

i 

1 

0.1809 

0.1459 

0.1254 

0.1109 

0.1049 

0.0904 

0.0904 

0.0812 

0 

0 

0.07 

2 

0.2434 

0.1819 

0.1407 

0.1115 

0.1115 

0.0704 

0.0704 

0.0704 

0 

0 

0 

3 

0.2021 

0.1592 

0.1292 

0.1189 

0.1068 

0.092 

0.0729 

0.0729 

0 

0 

0.046 

4 

0.189 

0.1488 

0.1275 

0.1129 

0.1039 

0.0934 

0.0804 

0.0804 

0 

0 

0.0637 

5 

0.1776 

0.1456 

0.1262 

0.1147 

0.1051 

0.0931 

0.0857 

0.0857 

0 

0 

0.0663 

6 

0.1744 

0.1428 

0.1263 

0.1132 

0.1049 

0.0947 

0.0886 

0.0816 

0 

0 

0.0733 

7 

0.189 

0.1488 

0.1275 

0.1129 

0.1039 

0.0934 

0.0804 

0.0804 

0 

0 

0.0637 

8 

0.171 

0.1415 

0.1251 

0.1121 

0.1055 

0.0978 

0.0883 

0.0826 

0 

0 

0.0761 

9 

0.1776 

0.1456 

0.1262 

0.1147 

0.1051 

0.0931 

0.0857 

0.0857 

0 

0 

0.0663 

2 

1 

0 

0.3164 

0 

0.2194 

0 

0.1836 

0 

0.1582 

0 

0 

0.1224 

2 

0 

0.3825 

0 

0.2211 

0 

0.1752 

0 

0.1106 

0 

0 

0.1106 

3 

0 

0.3461 

0 

0.2212 

0 

0.183 

0 

0.1249 

0 

0 

0.1249 

4 

0 

0.3348 

0 

0.2258 

0 

0.1841 

0 

0.1424 

0 

0 

0.1129 

5 

0 

0.3107 

0 

0.2208 

0 

0.1839 

0 

0.1499 

0 

0 

0.1347 

6 

0 

0.3073 

0 

0.2225 

0 

0.1848 

0 

0.1562 

0 

0 

0.1292 

7 

0 

0.3348 

0 

0.2258 

0 

0.1841 

0 

0.1424 

0 

0 

0.1129 

8 

0 

0.3021 

0 

0.221 

0 

0.1865 

0 

0.156 

0 

0 

0.1345 

9 

0 

0.3107 

0 

0.2208 

0 

0.1839 

0 

0.1499 

0 

0 

0.1347 

10 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0.5535 

0.4465 

2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0.5723 

0.4277 

3 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0.5594 

0.4406 

4 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0.5595 

0.4405 

5 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0.5495 

0.4505 

6 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0.5498 

0.4502 

7 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0.5595 

0.4405 

8 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0.5471 

0.4529 

9 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0.5495 

0.4505 

Table  III.  Transition  probabilities  pif  (fm)  of  i  =  1,2,  and  10  for  each  fm  e  F(i) 
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Table  IV.  Available  Control  Functions  and  State  Transitions  at  each  Goal  State 
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